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McShane and Wyner [(2011); hereinafter MW11] reiterate a well-known 
and central challenge of paleoclimatology: it is fraught with uncertainties 
and based on noisy observations. Decades of research have aimed at charac- 
terizing these uncertainties and interpreting proxies through laboratory ex- 
periments, field observations, theory, process-based modeling, cross-record 
comparisons, and indeed through statistical modeling and hypothesis test- 
ing. It is against this larger backdrop that the problem addressed by MW11 
must be considered. Attempts to reconstruct global or hemispheric tem- 
perature indices and fields using multi-proxy networks are an outgrowth of 
many efforts in paleoclimatology, but represent relatively recent pursuits in 
the field. They provide neither the principal scientific evidence supporting 
climate-proxy connections, nor the most compelling, and the inference by 
MW11 that their own findings demonstrate a widespread failure in the pre- 
dictive capacity of climate proxies is at odds with most other independent 
lines of proxy research. 

The above considerations notwithstanding, I focus on one principal ar- 
gument by MW11 that uses cross-validation experiments to conclude that 
"proxies are severely limited in their ability to predict average temperatures 
and temperature gradients? I demonstrate that this claim is based on a 
hypothesis test subject to Type II errors and therefore an inconclusive eval- 
uation of the temperature sensitivity of proxy archives. 
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Fig. 1. (a) Cross-validated RMSE on 30-year hold out blocks for a subset of the original 
MW11 experiments (recomputed by the current author) and for the newly completed experi- 
ments using the instrumental data perturbed with various levels of noise. The 86% red-noise 
experiment is considered to be representative of the average proxy predictor [Mann et al. 
(2007)] and compares well to the proxy result, (b) Area-weighted CPS reconstructions us- 
ing the 86% red-noise and 94% white-noise predictor sets. The means of the reconstructed 
time series have been set equal to the mean of the CRU target series during the 1961-1990 
C.E. interval. 

I perform additional cross-validation experiments using 283 time series 
that are randomly selected from the global CRU temperature field as in- 
filled and subselected (1732 total grid cells) by Mann et al. (2008). I choose 
283 samples based on the total number of unique 5° x 5° grid cells that 
contain the 1209 proxies in the Mann et al. (2008) proxy network. Time 
series from these cells are used to create five predictor datasets spanning the 
instrumental period by adding 0, 50, 80 and 94% white noise by variance, 
and 86% red noise by variance (p = 0.32), the latter of which has been ar- 
gued to be an average representation of the noise in proxy records [Mann 
et al. (2007)]. These datasets are used to repeat the MW11 analysis using 
the Lasso to target the CRU NH mean index across 120 cross-validation 
experiments. My results are summarized in Figure 1(a) and compared to 
my own reproduction of some MW11 experiments [see Smerdon (2011) for 
supplementary code and data]. 

I first note that even the no-noise experiment is subject to errors. This is 
an important baseline for the MW11 experiments, demonstrating that even 
"perfect proxies" are subject to errors due to incomplete field sampling. 
Most importantly, however, the skill of the instrumental predictors dimin- 
ishes with noise such that the 86% red-noise and 94% white-noise predictors 
perform comparable to or worse than the proxy network, and in turn per- 
form worse than the ARl(Emp) and Brownian motion null models tested by 
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MW11. Additionally, simple area- weighted composite-plus-scale (CPS) "re- 
constructions" from the Northern Hemisphere (NH) subset of the 86% red- 
noise and 94% white-noise predictor networks yield NH mean indices that 
compare well with the target [Figure 1(b); respective correlations between 
the target and CPS reconstructions are 0.73 and 0.62], indicating that skill- 
ful reconstructions are possible from networks with such noise levels. These 
findings are fundamental to the MW11 argument that the proxies are poor 
temperature recorders because they do not perform better than some noise 
models. To the contrary, the results that I present demonstrate that pre- 
dictor networks explicitly containing temperature signals — perturbed with 
approximate proxy noise levels — also do not beat the ARl(Emp) and Brow- 
nian motion noise models in cross-validation experiments and that skillful 
CPS reconstructions can be derived from such predictors. The appropriate 
conclusion is therefore not that the proxies are limited in their ability to 
predict NH temperatures, but that the test performed by MW11 is subject 
to Type II errors and is unsuitable for measuring the degree to which the 
proxies sample temperature. Note that this conclusion, although challenged 
by MW11, also supports arguments by Ammann and Wahl (2007) about the 
dangers of Type II errors in this paleoclimatic context. 

It is worth considering the likely reasons why the ARl(Emp) and Brown- 
ian motion noise models perform better than predictors explicitly containing 
temperature signals. As discussed by MW11, the large amounts of persis- 
tence in these models are good approximations of a principal characteristic of 
the target time series, namely its temporal autocorrelation. This fact, com- 
bined with the short cross-validation period, allows highly persistent time 
series to test well. But this success is likely also dependent on selections from 
many noise draws. MW11 have focused on the Mann et al. (2008) study that 
includes 1209 total proxies (1138 if the Lutannt series are excluded) and thus 
a large number of possible predictors in the MW11 noise experiments. In 
contrast, other NH temperature reconstructions have been successfully cross 
validated using only a few tens of proxies [e.g., Esper, Cook and Schwein- 
gruber (2002); Moberg et al. (2005); Hegerl et al. (2007)]. It therefore is yet 
unclear how the MW11 cross-validation tests would compare in scenarios 
using far fewer predictors. 

MW11 present a number of experiments that deserve further testing and 
analyses. The issues that I raise similarly come with their own set of caveats 
that cannot be explored in a short discussion paper. For example, a field sam- 
pling reflecting the true Mann et al. (2008) proxy locations with reduced 
ocean sampling and regional clustering might worsen the cross-validation 
skill of the instrumental predictors compared to the random sampling used 
in my experiments. Conversely, the true proxy distribution is more concen- 
trated in the NH, which may improve prediction of the NH mean index. 
I have also sampled each grid cell once, as opposed to multiple sampling 
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reflecting the occurrence of several proxies in a single grid cell. This latter 
sampling will in effect reduce the noise in the relevant cells, thus making 
the noise dependence of cross-validation skill less straightforward to inter- 
pret. These dependencies should be tested in future work. Nevertheless, the 
preliminary results that I have outlined suggest that the MW11 hypothesis 
test is subject to Type II errors and thus is not suitable for evaluating the 
reliability of proxy archives as temperature predictors. 

Acknowledgments. I thank Alexey Kaplan for many insightful discus- 
sions on the subject of this manuscript and Blakeley McShane for helpful 
technical exchanges. 

SUPPLEMENTARY MATERIAL 

Code and data files (DOI: 10.1214/10-AOAS398BSUPP; .zip). This sup- 
plement comprises a zip file containing the code and data files used for this 
manuscript and a README document describing all the files included in 
the directory. 
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